{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Ejercicio misceláneo: Feature Selection, Modelos predictivos"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. En un filtro antispam ¿prefeririría que sea más alta la presición o la exhaustividad? Justifique su respuesta.\n",
"\n",
"Asuma que el filtro elimina cualquier mensaje que detecte como spam."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"En este caso, **es más importante tener una alta precisión**. Porque una alta precisión significa que el número de falsos positivos es reducido. Es mucho peor tener un falso positivo que un falso negativo. Porque, digamos, si tengo un correo muy importante sobre trabajo y se lo marca como spam, es catastrófico. En cambio, si me llega un correo spam a mi bandeja de entrada, no es algo tan molesto."
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import os\n",
"\n",
"from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LogisticRegression"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Pregnancies \n",
" Glucose \n",
" BP \n",
" Skin Thickness \n",
" Insulin \n",
" Mass \n",
" Pedigree \n",
" Age \n",
" class \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 6 \n",
" 148 \n",
" 72 \n",
" 35 \n",
" 0 \n",
" 33.6 \n",
" 0.627 \n",
" 50 \n",
" 1 \n",
" \n",
" \n",
" 1 \n",
" 1 \n",
" 85 \n",
" 66 \n",
" 29 \n",
" 0 \n",
" 26.6 \n",
" 0.351 \n",
" 31 \n",
" 0 \n",
" \n",
" \n",
" 2 \n",
" 8 \n",
" 183 \n",
" 64 \n",
" 0 \n",
" 0 \n",
" 23.3 \n",
" 0.672 \n",
" 32 \n",
" 1 \n",
" \n",
" \n",
" 3 \n",
" 1 \n",
" 89 \n",
" 66 \n",
" 23 \n",
" 94 \n",
" 28.1 \n",
" 0.167 \n",
" 21 \n",
" 0 \n",
" \n",
" \n",
" 4 \n",
" 0 \n",
" 137 \n",
" 40 \n",
" 35 \n",
" 168 \n",
" 43.1 \n",
" 2.288 \n",
" 33 \n",
" 1 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Pregnancies Glucose BP Skin Thickness Insulin Mass Pedigree Age \\\n",
"0 6 148 72 35 0 33.6 0.627 50 \n",
"1 1 85 66 29 0 26.6 0.351 31 \n",
"2 8 183 64 0 0 23.3 0.672 32 \n",
"3 1 89 66 23 94 28.1 0.167 21 \n",
"4 0 137 40 35 168 43.1 2.288 33 \n",
"\n",
" class \n",
"0 1 \n",
"1 0 \n",
"2 1 \n",
"3 0 \n",
"4 1 "
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"diabetes_df = pd.read_csv(os.path.join(\"datasets\",\"diabetes.csv\"))\n",
"diabetes_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Pregnancies',\n",
" 'Glucose',\n",
" 'BP',\n",
" 'Skin Thickness',\n",
" 'Insulin',\n",
" 'Mass',\n",
" 'Pedigree',\n",
" 'Age']"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"VARIABLES_INDEPENDIENTES = list(diabetes_df.columns[:-1])\n",
"VARIABLES_INDEPENDIENTES"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2 Feature selection\n",
"\n",
"En el siguiente problema de clasificación. Determine el porcentaje adecuado de variables aplicando la Información mutua. El valor a partir de cual un incremento en el mismo no incrementa significativamente el desempeño del clasificador (2% o menos)"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.feature_selection import mutual_info_classif, SelectPercentile\n",
"\n",
"def seleccionar_variables(X, Y, porcentaje):\n",
" X_as_float = X.astype(np.float64)\n",
" select_features = SelectPercentile(mutual_info_classif, porcentaje)\n",
" X_new = select_features.fit_transform(X_as_float, Y)\n",
" return X_new"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [],
"source": [
"def datos_performance(performance_clasificador): \n",
" exactitud = [performance_clasificador[key]['exactitud'] for key in performance_clasificador]\n",
" nro_variables = list(range (1, 100, 10))\n",
" return pd.DataFrame(data={'exactitud' : exactitud}, index = nro_variables)"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score\n",
"\n",
"def evaluar_clasificador(variable_dependiente, \n",
" variables_independientes, \n",
" data_frame,\n",
" porcentaje):\n",
" \n",
" Y = data_frame[variable_dependiente] \n",
" X = seleccionar_variables(data_frame[variables_independientes], Y, porcentaje)\n",
" X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)\n",
" \n",
" clasificador = LogisticRegression(solver = 'liblinear')\n",
" clasificador.fit(X_train, Y_train)\n",
"\n",
" return {'exactitud': accuracy_score(Y_test, clasificador.predict(X_test)), \n",
" 'precision' : precision_score(Y_test, clasificador.predict(X_test)),\n",
" 'exhaustividad' : recall_score(Y_test, clasificador.predict(X_test))}"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [],
"source": [
"performance_clasificador = {}\n",
"\n",
"for i in range (1, 100, 10):\n",
" performance_clasificador['mutual_info_classif percentile - ' + str(i)] = evaluar_clasificador('class',\n",
" VARIABLES_INDEPENDIENTES,\n",
" diabetes_df,\n",
" i)"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" exactitud \n",
" \n",
" \n",
" \n",
" \n",
" 1 \n",
" 0.772727 \n",
" \n",
" \n",
" 11 \n",
" 0.779221 \n",
" \n",
" \n",
" 21 \n",
" 0.746753 \n",
" \n",
" \n",
" 31 \n",
" 0.766234 \n",
" \n",
" \n",
" 41 \n",
" 0.720779 \n",
" \n",
" \n",
" 51 \n",
" 0.740260 \n",
" \n",
" \n",
" 61 \n",
" 0.746753 \n",
" \n",
" \n",
" 71 \n",
" 0.792208 \n",
" \n",
" \n",
" 81 \n",
" 0.766234 \n",
" \n",
" \n",
" 91 \n",
" 0.746753 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" exactitud\n",
"1 0.772727\n",
"11 0.779221\n",
"21 0.746753\n",
"31 0.766234\n",
"41 0.720779\n",
"51 0.740260\n",
"61 0.746753\n",
"71 0.792208\n",
"81 0.766234\n",
"91 0.746753"
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = datos_performance(performance_clasificador)\n",
"df.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"df.plot.line(rot=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**El porcentaje adecuado de variables es el 71%**. Con ese número de variables se tiene la mayor exactitud. Además, si se incrementa el valor (81%), se observa que la exactitud reduce. Y este valor tiene una diferencia de más del 2 por ciento que el valor inmediatamente anterior (61%)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}